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Abstract 

In this paper, we study the large-scale protein interaction network of yeast utilizing a stochastic 
method based upon percolation of random graphs. In order to find the global features of connec- 
tivities in the network, we introduce numerical measures that quantify (1) how strongly a protein 
ties with the other parts of the network and (2) how significantly an interaction contributes to the 
integrity of the network. Our study shows that the distribution of essential proteins is distinct 
from the background in terms of global connectivities. This observation highlights a fundamental 
difference between the essential and the non-essential proteins in the network. Furthermore, we 
find that the interaction data obtained from different experimental methods such as immunopre- 
cipitation and two-hybrid techniques possess different characteristics. We discuss the biological 
implications of these observations. 
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I. INTRODUCTION 



Recent availability of a large amount of data from high-throughput experiments [1-5] has 
brought about a fundamental change in the way we study biological systems. Unlike the 
traditional methods which relied on probing a single or a few proteins to identify important 
pathways, it is now becoming possible to describe larger functional 'modules' [6] and even 
the global properties of the entire proteome [7-10]. Researchers are attempting to connect 
large-scale protein interaction data with information from phenotype studies [7, 8]. In 
one such analysis of data from yeast, Jeong et al. observed the connectivities of individual 
proteins in the network to closely follow a power-law distribution. Similar to other power-law 
networks, positive correlation existed between a protein's inviability and its connectivity [7] . 
In another study, Maslov et al. observed interesting patterns in the distribution of the links 
between the nearest neighbors in the network and postulated that such patterns give rise to 
the specificity and the robustness of the network [8]. 

One of the shortcomings of the previous approaches is that they drew conclusions about 
the global nature of the network from its local connectivity properties. It is unclear whether 
such local studies based on individual nodes or nearest neighbors fully capture the global 
picture of the network. For example, some essential proteins, namely, those for which null 
mutants produce inviable strains [11], may have few numbers of direct links but still take 
important roles in the network through the proteins to which they are connected. Such 
proteins would not be correctly identified by just counting the number of links as in Ref. [7]. 
To properly recognize such cases, it is necessary to go beyond the nearest neighbor links. 
However, it is not clear that the techniques mentioned above can easily be extended to 
answer such questions. 

In this paper, we introduce a stochastic method inspired by the percolation model in 
statistical mechanics[12] that overcomes the shortcomings of the previous approaches. This 
method allows us to define a quantity that measures the correlation between any two nodes 
in the network, taking the topology of the entire network into account. Biologically, such 
correlations describe the direct and indirect influences of one protein on another through 
the protein interaction network. If such correlations indeed carry biological significance, we 
expect the essential proteins to be highly correlated, in general, with the rest of the net- 
work. One of our main results is that most essential proteins do possess higher correlations 
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between themselves and the rest of the network. This is consistent with previous results [7] , 
because in the first order, the correlations computed by us are proportional to the connec- 
tivities of the proteins. However, we show that it is important to go beyond the first order. 
Identifying essential proteins by our method performs consistently better than just counting 
links. Additionally, we observe that the essential proteins interact more tightly with the 
other essential proteins, thus forming a 'network core'. This directly agrees with large-scale 
experiments probing protein networks [4]. 

Based on our method, we can also quantify the relative significance of an interaction to 
the integrity of the network. We observe that the interaction data from different measure- 
ment techniques, such as immunoprecipitation(IP) and the two-hybrid test, give distinct 
distributions. This suggests that various experimental techniques for probing the protein 
interaction might explore different regions of the network. 

II. METHOD AND MATERIALS 
A. Bond-percolation on Graph 

Given any two nodes in a network, the strength of their connectivity can be estimated in 
different ways. Some of these measures are local. For example, we can ask whether any two 
nodes are directed linked, how many common neighbors they share [13], etc. We can also ask 
how local properties of a node, such as the degree of links, associate with its function and 
its importance in the network[7]. Furthermore, information about the correlations between 
nodes involving nonlocal properties, such as the length of the shortest path and clustering 
structures, will enable us to uncover hidden features buried within the massive data. Here, 
we present a generic approach that extracts useful information about a node beyond its local 
connections. 

Correlations between two nodes may come from other numerous short paths rather than 
just the shortest path. A reasonable estimate of correlation should take into account the 
number and lengths of different paths between two nodes. One possible way to estimate 
such correlation between two nodes is to repeatedly remove some fraction q of the links 
in the network chosen randomly and check whether they still remain connected. Their 
probability remaining connected is proportional to the number of short paths between them 
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and inversely proportional the length of those paths. This probability provides a good 
measurement of the correlation between two nodes that includes the information regarding 
the non-local topology of the network. The described process of finding the correlation 
between two nodes in a network is equivalent to the bond-percolation model in statistical 
mechanics[12]. 

Mathematically, a network is treated in the language of graph theory, where a node is 
denoted as a vertex and a link as an edge. Given a graph G with vertices V and edges E, 
a percolation configuration is realized as follows. Each edge linking vertices i and j is 
assigned a random number pij distributed uniformly from to 1. If this random number is 
greater than p = 1 — q, a given percolation probability, then the edge is eliminated from the 
original graph. The final graph G' consists of the edge set E' = E — E, where E is the set of 
edges that p^ > p and E' consists those edges with p^ < p. Assuming that G is connected, 
the reduced graph G' may or may not remain a single connected component depending on 
p. 

B. Susceptibility 

The first step in applying the algorithm is to determine the appropriate value of the 
probability p. If p is near one, then we only produce totally connected graphs. If p is 
too close to zero, then the network is split into individual vertices and small clusters. An 
intermediate value of p provides information about the non-local properties of the network. 

The degree of fragmentation in the graph G' can be quantified by the order parameter 
m(p), the ratio of the largest connected component to the total graph size. It is defined as 
m(p) = iV max /|V|, where N max is the number of vertices of the largest connected component 
and | V\ is the total number of vertices. For a connected graph G, m(p) varies from 1/|V| to 
1 as p changes from to 1. Here, m is a stochastic variable, whose fluctuation is defined by 



The brackets denote the ensemble average, which is the average over many different realiza- 
tions of G'. The curve of xip) reveals certain aspects of the graph topology. For example, 
if G is a regular two dimensional square lattice, then x diverges with a power law behavior 
as a function of p — p c , for p c = 1/2. For other types of regular lattices, like triangular 
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lattices or higher dimensional lattices, p c and/or the power law exponent also change. A 
maximum in xip) occurs at the transition point p c , indicating a phase transition and critical 
behavior [12]. At this critical point, the distribution of the sizes of the connected clusters 
decay as a power law. Chosing a value of p near this critical value, we get the most non-local 
information regarding the network. 

C. Correlations and the definition of Vi 

Whether two arbitrary vertices i and j remain connected in G' can provide more detailed 
information about G. If two vertices retain their connection, it means that there exist paths 
in E' from vertex i to vertex j. Define 5ij as function of a pair of vertices i and j such that 
5{j = 1 if vertices i and j are connected, and 5y = otherwise. The percolation correlation 

is then defined as the ensemble average of 5y, 

Cij = (Sij). (2) 

With knowledge of the c^, we are equipped to measure how strongly a vertex i links to 
the rest of the network counting both direct and indirect connections to vertex %. We define 
the quantity Vi for vertex i, 

v i = W\Y, c a ( 3 ) 

I v I jev 

This value is sensitive not only to the linking degree at each vertex but also to higher order 
connections between a vertex and the rest of the random graph. Thus, i>j effectively ranks 
the importance of a vertex in the graph. Intuitively, vi may be interpreted as the fraction 
of other vertices to which vertex i remains linked, if each edge is broken with probability 
q = 1 — p in the graph G. In Fig. 1, we show the descending ranking order of the v^s for a 
small graph. 

D. The definition of 

Using a similar idea, we can define a quantity that allows us to check the influence of 
an edge on the graph integrity. The elimination of some edges may fundamentally change 
the connectivity properties whereas the graph topology may be relatively unchanged against 
the deletion of others. For example, for a small fully connected subgraph, termed a clique, 



5 



removal of a certain number of edges between the vertices of the subgraph tends not to 
separate the graph into disconnected pieces. Individual links in the subgraph do not play 
crucial roles in supporting the integrity of the subgraph and the whole graph. We define the 
quantity to monitor the importance of edge to the integrity of the graph, 

Aj = t^t 2 E ( c im (<?' U K» - c lm (G' \ {ey})) . (4) 

I v I l,meV 

The first term in the summation is correlation q m measured by adding in G' independent 
of pij and p. The second term in c\ m measured by removing in G'. The difference 
in measurement of c\ m under the presence or absence of edge allows us to distinguish 
edges. For example, if bridges two clusters, then (3^ will be elevated (note the edges 1, 
2 and 3 in Fig. 1). Suppose edge connects two disjoint connected components A and B 
with sizes ua and n-Q. Then, in a realization of G', the contribution to is the difference 
between E/, mG Aus &im = \nA + n B \ 2 and J2i, me A ^im + T,i, me B $im = \n A \ 2 + \n B \ 2 - Namely, the 
contribution to is proportional to n^n-Q. However, if is embedded within a connected 
component such that adding or removing does not perturb the component's connectivity, 
then eij is redundant and does not contribute to fy. With this interpretation, measures 
how well succeeds in connecting differing big components or modules. 

E. Protein interaction data 

Here, we apply the described method on the yeast protein interaction data taken from the 
Database of Interacting Proteins(DIP) [14]. The dataset contains 14871 interactions between 
4692 proteins [17] and includes interactions measured by different experimental methods. We 
treat the interaction network as an undirected graph, with the proteins as vertices. If two 
proteins are interaction partners in the dataset, the corresponding vertices are joined by an 
edge. 
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FIG. 1: We applied our algorithm with p = 0.43 on a small graph. The vertices are indexed in 
the descending order of v and the parenthesized numbers indicate the degree of connection. Some 
vertices, like vertex 3, have few neighbors but are out-ranked in terms of Vi to other vertices with 
more neighbors. Vertices with equivalent degree of connectivity might be ranked very differently 
because they have differing number of next nearest neighbors. The edges having largest eighteen 
fyj shown in gray and are ranked. If we remove these edges, the graph is severed into several 
compact subgraphs. The edges carrying largest (5ij tend to link different large components. The 
edges within a clique, like vertices 5,4,9,13, and 14, have the smallest (3ij. 

III. RESULTS AND DISCUSSIONS 

A. Determination of p 

As a first step in applying this stochastic method on the protein interaction network, we 
need to determine the appropriate value of p. If p is near one, then we will only produce 
totally connected graphs. If p is too close to zero, then we will only obtain information about 
small clusters. Some intermediate value of p will give us global properties of the network. 

In order to determine the proper value of p, we need to compute the curve x(p)- Such a 
curve for the DIP data is shown in Fig. 2. The curve peaks at about p = 0.07, where the size 
fluctuations of the largest cluster are maximal. Most realizations of the percolation graph 
G' in the neighborhood of this peak yield sparse but still predominantly connected graphs. 
Accordingly, computing V; L and Pij around this peak in xip) avoids the finite size effect at 
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smaller p and loss of resolutions at larger p. 
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FIG. 2: Susceptibility curve of the parameter m. The curve peaks at p = 0.07, where the fluctua- 
tions of m are greatest. 



B. Distribution of Vi 

We gathered our data from 10 5 realizations of the graph at p — 0.07. The distribution of 
log(vi) for the protein interaction network is shown in Fig. 3. We also report the distributions 
of a subset composing only the essential proteins[18]. The distribution of vi for essential 
proteins significantly differs from the background distribution and is biased toward greater 
Vi. A protein with a greater vi ties to the network more strongly than a protein possessing 
a smaller V{. Therefore, we would predict that removing a protein from yeast with a greater 
Vi harms more biologically important pathways and would thereby be more likely to destroy 
viability. The percentage of proteins having a given Vi which are essential ( (number of 
essential proteins of a given t>j)/(number of proteins of the given Vi) ) is shown in Fig. 4. 
This percentage has strong positive correlation with Vi, in agreement with the prediction. 

What are the specific connectivity properties that produce a large v^ for a specific protein? 
To a first order approximation, V; t is proportional to the degree of connectivity of the i th 
protein. Since a protein with k interactions is usually connected to at least p ■ k proteins, 
in the first order Vi is proportional to fcj. However, the protein interaction network displays 
small world properties [19], Therefore, the correction to Vi from higher order connections 
should be included. For example, if the number of next-nearest neighbors of a protein is 
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FIG. 3: Histogram of log(uj). The distribution of Vi for essential proteins is skewed toward larger 
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FIG. 4: The percentage of proteins which are essential as a function of V{. 



much greater than the number of nearest neighbors, then the contribution from the next- 
nearest neighbors is comparable to that of the nearest neighbors. In such a case, the proteins 
with the same ki have a broad distribution of as in our results. The value of v i gives more 
extensive information about the protein's connectivity in the network beyond that of its 
nearest neighbors. 

Our method is advantageous because we can identify important proteins that might 
otherwise not be considered significant because they have lower first-order interaction degree. 
Such proteins probably control other essential proteins through a few critical interactions. 
To illustrate the power of this approach compared to merely counting the nearest neighbor 
degree of interactions, we rank the proteins by Vi and compare the result to the ranking by 
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ki (see Table I). For example, 61% of the proteins in the top 2% of Vi are essential, whereas 
only 52% of the proteins in the top 2% of h are required for viability. Such a result suggests 
the essential proteins with higher Vi not only have more interactions but are also more likely 
to interact more frequently with other proteins, which also tend to be essential. A similar 
observation has been reported by Gavin, et al. [4], and our independent evidence supports 
their hypothesis. 



All Proteins 


Essential Proteins 


Percentile 


by Vi 


by h 


by Vi (randomize) 


2% (94) 


61% 


52% 


53% 


5% (234) 


53% 


47% 


50% 


10% (469) 


48% 


46% 


48% 


25%(1173) 


39% 


38% 


38% 



TABLE I: The percentage of essential proteins in selected percentiles ranked by Vi and the degree 
of connection ki. In the top 92 proteins ranked by Vi, 61% of them are essential while only 52% of 
essential proteins are captured when ranked by ki. The third column is a control in which the Vi 
are recalculated for a (quasi-) randomized graph in which edges have been swapped while retaining 
the degrees of connection of all vertices in the original graph. Identifying essential proteins by 
calculating Vi performs consistently better than only computing ki, demonstrating the significance 
of nonlocal structure beyond that of nearest neighbor relations. If we randomly perturb the global 
graph structure, the ability to identify essential proteins drops, even though the degree of connection 
at each vertex is unchanged. 

The proteins with 10 highest Vi are listed in Table II. The full list of proteins with their 
Vi can be found in the supplemental web site [20]. A selection of a few essential proteins with 
high Vi but low ki is also shown in Table III. 

C. Distribution of 

The interactions in the network can be grouped by the experimental methods used to 
detect them. We score each interaction within the network by fy. The distribution of 
log(Pij) (Fig. 5) provides a mechanism to detect differences amongst different subsets of 
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protein 


Vi 


ki 


viability 


SRP1 


0.0623 


196 


inviable 


TEM1 


0.0531 


115 


inviable 


JSN1 


0.0524 


282 


viable 


YDL213C 


0.0516 


58 


viable 


CKA1 


0.0513 


65 


viable 


NUP116 


0.0505 


146 


inviable 


ERB1 


0.0494 


55 


inviable 


HHF1 


0.0486 


74 


viable 


N0P2 


0.0479 


48 


inviable 


CDC95 


0.0475 


48 


viable 



TABLE II: List of the proteins with 10 highest Vj. 





protein 


Vi 


ki 


protein 


Vi 




UTP8 


0.0084 




MAK11 


0.0127 




YKL088W 


0.0081 




BMS1 


0.0124 


3 


DYS1 


0.0075 


5 


YPR144C 


0.0117 




TRL1 


0.0070 




ACS2 


0.0113 




GRS1 


0.0068 




DIP2 


0.0112 




RLP24 


0.0115 




NOP14 


0.0133 




ROK1 


0.0106 




NOC3 


0.0131 


4 


SPB4 


0.0101 


6 


SEN1 


0.0124 




MES1 


0.0094 




YLL034C 


0.0123 




SEC18 


0.00868 




DIB1 


0.0110 



TABLE III: A selection of a few essential proteins with high Vi but low ki. 

interactions obtained by varied experimental methods. In Fig. 5, we compare the distribution 
of log(/5jj) from the whole network to distribution derived from several subsets of the network. 
First, we use the subset, as the core set, of the interactions that was derived by Deane et 
al. [14]. Interactions in the core set are statistically verified to reduce the false positive rate, 
yielding 1925 interactions (excluding self-interacting pairs). The distribution of log(/3y) for 
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the core set is similar to that obtained for the entire network. However, upon comparing the 
distribution of log for subsets of those interactions obtained from different experimental 
procedures, differences emerge. For example, interactions measured by immunoprecipitation 
tends to have a larger fy, so that the distribution of log(/3jj) of this subset shifts to the right. 
In contrast, the distribution for the subset of interactions measured with high-throughput 
two-hybrid tests display the opposite trend. 

0.5 | , 1 , 1 , 1 , 1 , 1 , 

— All 
■ ■ ■ Core 
IP 

/\ Two hybrid 




FIG. 5: Normalized distributions of log(/3y) for different subsets of interactions. The solid line 
represents the distribution for all interactions in the data. The dotted line corresponds to the 
core set extracted by Deane, et oZ[14]. The short dashed line refers to interactions obtained by 
immunoprecipitation, and the long dashed line represents the subset of interactions derived from 
high-throughput two- hybrid tests. 

If Cij is the only edge linking two clusters, the contribution of a particular realization 
of the percolation procedure to is proportional to the product of the sizes of the two 
clusters. Hence, an edge with a greater has a greater tendency to link two large modules 
or clusters in the network. With this notion in mind, an examination of Fig. 5 suggests that 
the IP method is possibly more sensitive to interactions between proteins in different large 
modules while the two-hybrid tests are better suited to detecting interactions which tend 
not to link larger modules. 

The discrepancy between the IP method and the two-hybrid tests might reflect the under- 
lying biochemical differences between the two methods. Unlike IP, the two-hybrid test is an 
in vivo technique, and thus it can detect transient and unstable interactions [9]. Our analysis 
of the distribution of log(/3y) for the two-hybrid data is a quantitative demonstration that 
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these transient and unstable interactions contribute less to the integrity of the interaction 
network. 

IV. CONCLUSION 

We presented a stochastic algorithm that explored the global connectivity properties of a 
protein interaction network. This percolation-based algorithm allowed us to assign weights to 
vertices and edges according to non-local topological properties. We applied the algorithm 
to the protein interaction network for yeast and found that the percentage of essential 
proteins correlated strongly with V{. Importantly, the values of Vi, which incorporated the 
knowledge of connections beyond the nearest neighbors, could more successfully discriminate 
essential proteins than a method based solely on local connections. In addition, the essential 
proteins with greater not only possessed more interactions with any other proteins but also 
displayed more interactions with other essential proteins. This result suggested that essential 
proteins along with other proteins having greater Vi might form a "core network" with a 
higher density of interactions within the "core network" than the background network. If this 
unverified hypothesis is confirmed, then we would gain significant insight into the evolution 
of a protein interaction network. Are the proteins in this "core network" in general more 
evolutionarily conserved than others? Hunter et al. claimed that there is significant negative 
correlation between each protein's degree of connectivity and protein evolutionary rate, and 
that evolutionary change may occur largely by coevolution [15]. If this is indeed so, we 
expect a stronger correlation between and protein evolutionary rate, since Vi provides a 
better resolution than the degree of connectivity for proteins' positions in their interaction 
network. 

The scores for interaction could distinguish the differences between different exper- 
imental methods for measuring protein interactions. Such a quantitative measure of the 
distinction amongst the experimental approaches will aid the interpretation of the proteomic 
data. 

In principle, can be calculated exactly given a percolation probability p. However, 
this would require recursive iterations over all possible sub-graphs. Our stochastic approach 
efficiently obtains the approximations to the exact value of c^-, v,i and fy. In this work, we 
model the interaction network as a static graph with uniform weight on each edge. For a 
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biological system, dynamical aspects need to be incorporated. Various experimental methods 
for probing the physical interactions between proteins respond differently to the dynamics of 
biological systems. The two-hybrid test is more sensitive to transient interactions while the 
IP method is more sensitive to large and stable protein complexes. The differences might 
be addressed from different dynamics aspects in the interaction network. 

With regard to future pursuits, we note that it is also possible to use to cluster 
vertices within a random graph. The score for a random graph is similar to the edge 
"betweenness" , defined as the number of shortest paths between all pairs of vertices passing 
through a given edge. An edge with a greater (3ij is likely also an edge with a greater edge 
"betweenness" , because such an edge has great tendency to bridge two different clusters or 
modules. Clustering utilizing edge "betweenness" have been successfully applied to certain 
types of random networks[16]. We expect that results similar to those shown in Fig. f could 
be achieved with not only for this small test graph but more significantly for larger 
graphs in which the computational cost of calculating edge "betweenness" is prohibitive. 
For the present, however, the idea of percolation on random networks provides a natural 
mechanism for revealing dominant cluster structure within a graph. We hope such natural 
cluster structure will provide further details about the protein interaction network. 
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